![]() |
|||
![]()
|
![]() |
![]() Click Here! |
![]() |
MTBF versus Mean Time to Occurrence (MTTO) MTBF estimates the time between hard failures in physical devices. MTBF estimates do not record the incidence of intermittent soft failures. Even in sheltered environments (e.g., well-conditioned computer rooms) transient or intermittent (systems) errors are 20 to 50 times more prevalent than hard failures. In an analysis of system error logs and interviews with field service personnel at Carnegie Mellon University (the results are graphed in Exhibit 1-6-3), systems fault data for a 13-server network providing mass storage for 5,000 nodes was tabulated. The data represents 21 workstation years of fault history. Permanent faults (i.e., hard failure) showed a mean time to occurrence of 6,552 hours. Intermittent faults (i.e., a device-specific, weekly recurring fault pattern) showed an MTTO of 58 hours, and transient faults (i.e, random nonspecific faults) were reported at 354 hours MTTO. Overall, systems crashes recorded an MTTO of 689 hours. Hard failures represented only 10% of system crashes.
Further analysis revealed that only about 25% of the transient or intermittent faults actually cause a system crash and that 75% of these faults resulted in recoverable situations. The studys managers offer the following comments regarding system-level fault modeling. The manifestations of intermittent and transient faults . . . are much harder to determine than permanent faults. . . . Because the fault is present only temporarily, and because most computer systems do not have online error detection, the normal manifestations of an intermittent fault are at the system level (such as system crash or I/O channel retry). Transient faults and incorrect designs do not have a well defined, bounded, basic fault model. Transient faults are a combination of local phenomenon (such as ground loops, static discharges, power lines, and thermal distributions) and universal phenomenon (such as cosmic rays, alpha particles, power supply characteristics, and mechanical design). Even if models could be developed for transient faults and incorrect designs, they would quickly become obsolete because of the rapid changes in technology. Site Risk Every field service manager or systems support manager with multiple-site responsibilities will relate that some percentage of their sites are afflicted with an above average number of system faults. Some sites are chronic problem spots. Whatever method is used to estimate predicted system MTBF, actual MTBF performance for a given electronic device will be improved or worsened by the quality of the electrical operating environment. Quality of the power distribution network in the operating environment of a proposed system or device can be evaluated prior to system installation. A power quality evaluation can be used to determine whether the site electrical infrastructure will pose an above or below average risk to the reliable operation of electronic devices that make up the network. Reliability needs assessments and site risk evaluations should be included as part of the general systems requirements assessment process. Once the value of reliability is determined, and the site risk is evaluated, it is easier to design a level of fault tolerance or fault intolerance (i.e., avoidance) into the system that is cost-justified for the operational requirements. MANAGING FOR MAXIMUM SYSTEM AVAILABILITY Paying attention to the building power distribution network and enhancing it with power conditioning devices might be considered first- and second-stage fault avoidance techniques. Installation of a backup power system (i.e., UPSs) might be considered a third-stage technique. Bringing an intelligent UPS under the umbrella of a sophisticated network management utility could be considered a fourth-stage practice. Comparisons of specific methods and architectures are covered later in this chapter. The balance of this section is an overview of each of the stages. Stage 1: Auditing Building AC Distribution Building power systems must be safe for people as well as adequate for supporting the basic power requirements of network devices. All branch circuits must be installed and grounded in accordance with the National Electrical Code (NEC) sponsored by the National Fire Protection Association. Integrity A well-managed facility has procedures for the regular inspection of the buildings branch circuit connections, including wall receptacles, junction boxes, and distribution panels. Usually, however, electrical system inspections are rare. Left uncorrected, loose connections pose a fire and safety hazard, but they can also cause momentary interruptions and high-frequency electrical transients any time there is movement in the connection (e.g., when equipment is plugged in or power cords are bumped). Capacity Typical office building branch circuits are rated to carry either 15 amperes (A) or 20 A. Periodic inspections should include at least a visual review of the regular loads on each circuit. It is not a good idea to have such critical load as file servers compete with harsh or heavy loads such as copiers, or laser printers. Loads for typical types of electronic office equipment are shown in Exhibit 1-6-4. Outlet types for 15 A and 20 A lines are shown in Exhibit 1-6-5.
Heavy loads can push a circuit over its limits if the combined loads exceed the rating. If a computer system is on that circuit, it will crash when the breaker trips. For this reason, no more than two file servers should be placed on each branch circuit. In some retail store electrical plants, which are engineered to support sophisticated point-of-sale networks, an entire sub-panel is dedicated to computer circuits. In these plants, computer branch circuits are at least physically segregated from lighting and other building loads, even if they are not electrically isolated from noise created by those loads. Several special wiring practices are supposed to improve computer system reliability by reducing electrical noise caused by ground connection effects. However, some of the practices are erroneous, and others are safe but fall short of the desired effect. In the final analysis, about all that can reasonably be expected from a buildings power distribution infrastructure is that it satisfy the requirements of the National Electrical Code.
|
![]() |
|
Use of this site is subject certain Terms & Conditions. Copyright (c) 1996-1999 EarthWeb, Inc.. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of EarthWeb is prohibited. Please read our privacy policy for details. |